Goto

Collaborating Authors

 spam message


A Persuasion-Based Prompt Learning Approach to Improve Smishing Detection through Data Augmentation

Shim, Ho Sung, Park, Hyoungjun, Lee, Kyuhan, Park, Jang-Sun, Kang, Seonhye

arXiv.org Artificial Intelligence

Smishing, which aims to illicitly obtain personal information from unsuspecting victims, holds significance due to its negative impacts on our society. In prior studies, as a tool to counteract smishing, machine learning (ML) has been widely adopted, which filters and blocks smishing messages before they reach potential victims. However, a number of challenges remain in ML-based smishing detection, with the scarcity of annotated datasets being one major hurdle. Specifically, given the sensitive nature of smishing-related data, there is a lack of publicly accessible data that can be used for training and evaluating ML models. Additionally, the nuanced similarities between smishing messages and other types of social engineering attacks such as spam messages exacerbate the challenge of smishing classification with limited resources. To tackle this challenge, we introduce a novel data augmentation method utilizing a few-shot prompt learning approach. What sets our approach apart from extant methods is the use of the principles of persuasion, a psychology theory which explains the underlying mechanisms of smishing. By designing prompts grounded in the persuasion principles, our augmented dataset could effectively capture various, important aspects of smishing messages, enabling ML models to be effectively trained. Our evaluation within a real-world context demonstrates that our augmentation approach produces more diverse and higher-quality smishing data instances compared to other cutting-edging approaches, leading to substantial improvements in the ability of ML models to detect the subtle characteristics of smishing messages. Moreover, our additional analyses reveal that the performance improvement provided by our approach is more pronounced when used with ML models that have a larger number of parameters, demonstrating its effectiveness in training large-scale ML models.


SMS Spam Detection and Classification to Combat Abuse in Telephone Networks Using Natural Language Processing

Oyeyemi, Dare Azeez, Ojo, Adebola K.

arXiv.org Artificial Intelligence

In the modern era, mobile phones have become ubiquitous, and Short Message Service (SMS) has grown to become a multi-million-dollar service due to the widespread adoption of mobile devices and the millions of people who use SMS daily. However, SMS spam has also become a pervasive problem that endangers users' privacy and security through phishing and fraud. Despite numerous spam filtering techniques, there is still a need for a more effective solution to address this problem [1]. This research addresses the pervasive issue of SMS spam, which poses threats to users' privacy and security. Despite existing spam filtering techniques, the high false-positive rate persists as a challenge. The study introduces a novel approach utilizing Natural Language Processing (NLP) and machine learning models, particularly BERT (Bidirectional Encoder Representations from Transformers), for SMS spam detection and classification. Data preprocessing techniques, such as stop word removal and tokenization, are applied, along with feature extraction using BERT. Machine learning models, including SVM, Logistic Regression, Naive Bayes, Gradient Boosting, and Random Forest, are integrated with BERT for differentiating spam from ham messages. Evaluation results revealed that the Na\"ive Bayes classifier + BERT model achieves the highest accuracy at 97.31% with the fastest execution time of 0.3 seconds on the test dataset. This approach demonstrates a notable enhancement in spam detection efficiency and a low false-positive rate. The developed model presents a valuable solution to combat SMS spam, ensuring faster and more accurate detection. This model not only safeguards users' privacy but also assists network providers in effectively identifying and blocking SMS spam messages.


SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection

Li, Yekai, Zhang, Rufan, Rong, Wenxin, Mi, Xianghang

arXiv.org Artificial Intelligence

In this study, we introduce SpamDam, a SMS spam detection framework designed to overcome key challenges in detecting and understanding SMS spam, such as the lack of public SMS spam datasets, increasing privacy concerns of collecting SMS data, and the need for adversary-resistant detection models. SpamDam comprises four innovative modules: an SMS spam radar that identifies spam messages from online social networks(OSNs); an SMS spam inspector for statistical analysis; SMS spam detectors(SSDs) that enable both central training and federated learning; and an SSD analyzer that evaluates model resistance against adversaries in realistic scenarios. Leveraging SpamDam, we have compiled over 76K SMS spam messages from Twitter and Weibo between 2018 and 2023, forming the largest dataset of its kind. This dataset has enabled new insights into recent spam campaigns and the training of high-performing binary and multi-label classifiers for spam detection. Furthermore, effectiveness of federated learning has been well demonstrated to enable privacy-preserving SMS spam detection. Additionally, we have rigorously tested the adversarial robustness of SMS spam detection models, introducing the novel reverse backdoor attack, which has shown effectiveness and stealthiness in practical tests.


Sampling Audit Evidence Using a Naive Bayes Classifier

Sheu, Guang-Yih, Liu, Nai-Ru

arXiv.org Artificial Intelligence

Taiwan's auditors have suffered from processing excessive audit data, including drawing audit evidence. This study advances sampling techniques by integrating machine learning with sampling. This machine learning integration helps avoid sampling bias, keep randomness and variability, and target risker samples. We first classify data using a Naive Bayes classifier into some classes. Next, a user-based, item-based, or hybrid approach is employed to draw audit evidence. The representativeness index is the primary metric for measuring its representativeness. The user-based approach samples data symmetric around the median of a class as audit evidence. It may be equivalent to a combination of monetary and variable samplings. The item-based approach represents asymmetric sampling based on posterior probabilities for obtaining risky samples as audit evidence. It may be identical to a combination of non-statistical and monetary samplings. Auditors can hybridize those user-based and item-based approaches to balance representativeness and riskiness in selecting audit evidence. Three experiments show that sampling using machine learning integration has the benefits of drawing unbiased samples, handling complex patterns, correlations, and unstructured data, and improving efficiency in sampling big data. However, the limitations are the classification accuracy output by machine learning algorithms and the range of prior probabilities.


NLP Techniques being Helpful for Spam Detection

#artificialintelligence

NLP techniques are used to train data to detect Spam. In today's multimedia-driven world, we're gathering information and connecting with people has become extremely easy due to social media and the internet. Due to which we get hundreds of messages and emails daily out of which many of them are unwanted. These unwanted messages are called spam and the useful ones are called ham mails. Today we shall see how spam filtration with Natural Language Processing (NLP) is implemented on data to get classified data to train our models to detect spam messages.


What role does AI play in cybersecurity?

#artificialintelligence

Many believe that cybersecurity is an exciting field to work in, and indeed it is. Yet being responsible for an organization's IT Security is no easy feat. Attackers always seem to be a few steps ahead of defenders. It often feels like a game of one against many – from petty criminals to nation-states. It would be highly advantageous if our cybersecurity tools could automatically adapt to these threats.


Spam Email Detection Using Machine Learning

#artificialintelligence

There are 4,825 ham and 747 spam messages. This indicates the data is imbalanced which needs to be fixed. The top ham message is "Sorry, I'll call later", whereas the top spam message is "Please call our customer service…" which occurred 30 and 4 times, respectively. First, let's create a separate dataframe for ham and spam messages and convert it to NumPy array and then to a list to generate WordCloud later. Since it is a text data, there are many unnecessary stopwords like articles, prepositions etc., which needs to be removed from the data.


Understanding Naïve Bayes and Support Vector Machine and their implementation in Python

#artificialintelligence

This article was published as a part of the Data Science Blogathon. In this digital world, spam is the most troublesome challenge that everyone is facing. Sending spam messages to people causes various problems that may, in turn, cause economic losses. By spamming messages, we lose memory space, computing power, and speed. To remove these spam messages, we need to spend our time.


Top 20 Dataset in Machine Learning

#artificialintelligence

To build a machine learning model dataset is one of the main parts. Before we start with any algorithm we need to have a proper understanding of the data. These machine learning datasets are basically used for research purposes. Most of the datasets are homogeneous in nature. We use a dataset to train and evaluate our model and it plays a very vital role in the whole process. If our dataset is structured, less noisy, and properly cleaned then our model will give good accuracy on the evaluation time. Imagenet dataset is made by the group of researchers and the images in the dataset organized according to the WordNet hierarchy. This dataset can be used for machine learning purposes and computer vision research fields as well.


How Machine Learning in Search Works: Everything You Need to Know

#artificialintelligence

In the world of SEO, it's important to understand the system you're optimizing for. Another crucial area to understand is machine learning. Now, the term "machine learning" gets thrown around a lot these days. But how does machine learning actually impact search and SEO? This chapter will explore everything you need to know about how search engines use machine learning.